Course: Applied Geo-data Science at the University of Bern (Institute of Geography)

Supervisor: Prof. Dr. Benjamin Stocker

Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider

Further information: https://geco-bern.github.io/agds/

Do you have questions about the workflow? Contact the Author:

Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)

Matriculation number: 20-100-178

Reference: Report Exercise 6 (Chapter 9, Machine Learning 1)

1 Introduction

1.1 Objectives

This exercise is an introduction to supervised machine learning. With a realistic problem, the following goals should be trained:

  • Implementation of the KNN algorithm

  • Discuss the bias-variance trade-off

  • The role of the hyper-parameter k

1.2 Theory

Supervised machine learning is a type of machine learning where the model is trained using labeled data to predict new and unseen data as accurately as possible (with the lowest error in the new data). But this approach has some challenges. For example, we need to know how good the quality of the predicted data is. Therefore, we use only a part of the labeled data for training. With the other part of the labeled data, we quantify the error. The KNN-algorithm will be trained to predict new data. With the hyper-parameter k we can fit our model (a polynomial with a certain degree). But this is not so easy because we do not know what optimal k is.

Consider the following: With a polynomial p with \(deg(p) = n\), we can easily hit every target value in the training data exactly because the polynomial can be bent, stretched, or compressed. The bias in the training data will be zero. Unfortunately, it would also include all errors. The target and the predictor come from measurements and have always errors (always statistical and almost always systematic errors (measurement noise)). If we apply this model to predict new data, then we would project the error into the new data. That will cause a higher variance in the predicted data because our model is overfitted. The bias-variance trade-off in the train data shifts to the variance.

If we use a polynomial p with \(deg(p) = 0\) we will not have a good prediction. The model describes the target in the training data so badly that it will not be able to predict it well. The RSME (root-square-mean-error) is high in the training data and will be high in the test data. We call that an underfitted model, and the bias-variance trade-off shifts to a very high bias.

The goal is to find the model with the best generalisability, which is the model with the probably best bias-variance trade-off. This exercise shows you how we can find the best model with a KNN algorithm.

2 Methods

We use the KNN (k-nearest neighbors) algorithm and compare the model to a multivariate linear regression model. For the KNN, we split our data. We use 70% for training and 30% to evaluate the model. Because the algorithm uses distance, we have to standardize our data. With the box-cox function, we transform our data into a normal distribution. We use set.seed for reproducibility.

2.1 R: Evaluation of the data

The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.

3 Programming and data evaluation

3.1 Packages

The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.

source("../../R/general/packages.R")

3.2 Read the file

First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.

name.of.file <- "../../data/re_ml_01/daily_fluxes.csv"

# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
  
  # Access to the data
  url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data
  /FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
  
  # Read in the data directly from URL
  daily_fluxes <- read.table(url, header = TRUE, sep = ",")
  
  # Write a CSV file in the respiratory
  write_csv(daily_fluxes, "../../data/re_ml_01/daily_fluxes.csv")
  
  # Read the file
  daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")
  
  # If exists such a file, read it only!
  }else{daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")}

3.3 Data Overview

3.3.1 Missing values

If we take a closer look at the file, we see that there are 334 variables. We can not continue with our usual procedure because there are too many variables. That is why we have to clean our data first.

3.3.2 Data cleaning

We select only the columns we need. Further, we can see that some columns contain -9999 as a value. Our quality function changed that to NA. Then we use ymd() from the lubridate package to rewrite the date in a proper way. Further, we want only columns which contain good quality. For that, we check selected columns with their quality control column. If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). After that, we discard all quality-control columns. Now we can see that the variable LW_IN_F (long-wave radiation) contains about 56% NAs. This is very problematic because for model calculation we will drop all rows that contain NAs. We would lose about 56% of our data. Further, the variable TIMESTAMP contains the date. But we are not sure if we can standardize time with box-cox. That is why we do not use these variables. All other variables have good quality.

Attention: unlike here, the variables PA_F and P_F are not used in the AGDS script. Therefore, the results will be significantly different, for example RMSE, RSQ or optimal k.

# Load the function in the file
source("../../R/re_ml_01/function.use.good.quality.only.R")

# Function call to clean the data
daily_fluxes <- use.good.quality.only(daily_fluxes.raw)

# Load the function
source("../../R/general/function.visualize.na.values.R")

# Function call
visualize.na.values.without.groups(daily_fluxes)

3.4 Implementation of the KNN algorithm

3.4.1 Split the data

Here we split our data into a training subset and a test subset (70% training and 30% test). After we split the data, we can calculate our model. We use all variables except TIMESTAMP and LW_IN_F. For KNN, we use k = 8 as a first try. This choice is random. The last code chunk in this workflow will show you which k is the optimal one (we do not want to spoil anything here). That is why you should use k = 8 first.

# For reproducibility (pseudo-random)
set.seed(123)  
# Split 70 % to 30 % 
split <- rsample::initial_split(daily_fluxes, prop = 0.7, strata = "VPD_F")

# Split the data in a training set
daily_fluxes_train <- rsample::training(split)
# And a test set
daily_fluxes_test <- rsample::testing(split)

# Load the function in the file.
source("../../R/re_ml_01/function.train.and.test.R")

# Function call for KNN with k = 8
mod.knn <- knn.model(daily_fluxes_train, 8)

# Function call for lm
mod.lm <- lm.model(daily_fluxes_train)

3.4.2 Evaluation

After we split our data, we have to make a model evaluation. The first function call is for the LM model. The function needs the model, which we have calculated in the code chunk above. This is a multivariate regression model, and therefore, we cannot improve the formula.

The second function call is for the KNN model. The knn-model has been calculated in the code chunk above. The number for k was 8. If we want a model with a different k, we have to recalculate the knn-model with the function in the code chunk above.

lm-Model

The orange line has the slope of 45°. If the model predicted all values accurately, all values would lie on this line because the fitted value would be the same as the target.

The red line is the calculated regression line. It shows that it has a lower angle than the orange line. It follows that our model systematically underestimates the values. It also shows that especially high values are more underestimated than smaller ones.

knn-Model

The situation for knn is very similar. The regression line is a little steeper than that of the lm model. This means that knn underestimates the values less than the lm model. But also knn tends to underestimate high values more than small values.

# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")

# Function call for linear regression model
eval_model(mod = mod.lm, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (lm)"), c("Davos (lm)"))

# Function call for KNN
eval_model(mod = mod.knn, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (knn: k=8))"), c("Davos (knn: k=8)"))

3.4.3 Visualization

Here you can see how the knn-model depends on k. If we use k = 0, then we have perfect training data (perfect prediction. The regression line has the a 45° slope and all values lie on it. The bias is zero). But the RMSE in the test data is high (the model is overfitted). If we use k = n our model would be underfitted, and therefore the RMSE in the test data would also be high. For a visualization of the development of RSQ and RSME as a function of k, please take a look at the last visualization.

# Load the function into the file
source("../../R/re_ml_01/function.different.k.R")

# Define a sequence for k
my.sequence <- c(1, 8, 15, 50, 100)
# Function call
different.k(my.sequence,daily_fluxes_train,daily_fluxes_test,c("Davos"),c("Davos"))

## k-Nearest Neighbors 
## 
## 1910 samples
##    8 predictor
## 
## Recipe steps: BoxCox, center, scale 
## Resampling: None

4 Discussion

In this short discussion part, we want to show you two main aspects of this exercise and discuss each aspect shortly. First, we need to know how to judge the models. For this, we need to know why one model can be considered better than another. How can we estimate the bias-variance trade-off? Second, we need to know how the KNN algorithm works. For this, we need knowledge about the role of the hyper-parameter.

4.1 Model discussion and the role of the bias-variance trade-off

4.1.0.1 Why is the difference between the evaluation on the training and the test set larger for the KNN model than for the linear regression model?

To answer this question, we have to take a closer look at the formulas:

An lm model uses all predictors to calculate a multivariate regression model. However, it is a linear regression. It means that the polynomial has a deg(polynomial) = 1. If we use this equation, then we made some assumption (linearity, normality, homoscedasticity and independence). If our data has a annual cycle, then we will never have a good fit (because we violated the linearity). We must that consider.

\[ \hat{y}=a_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]

Obviously, a lm model does not look for any underlying pattern. It creates simply a linear combination with all available variables. Further, it is not possible that RSQ = 1 (only if the predictor is also the target and the predictors are linearly dependent). Therefore, the lm model will always have a bias. It follows that the lm model has a higher RMSE (e.g. because linearity is violated) in the training data set and therefore, the RMSE in the prediction will be high.

The situation for KNN is different. We could calculate a polynomial regression with the following formula:

\[ \hat{y}=\sum_{n=0}^{N}\hat{a}_n*x^n \]

Knn does not necessarily use all predictors. It uses k neighbors to calculate a local polynomial regression. The KNN model looks for an underlying pattern. That is why the RSQ in the training data is lower than for the lm model. If we want to make a prediction, the KNN has an advantage over the lm. KNN “knows” the underlying pattern, and the RMSE and MAE will not increase so fast as for the lm model.

4.1.0.2 Why does the evaluation on the test set indicate a better model performance of the KNN model than the linear regression model?

We also see that knn has the higher RSQ than lm (0.64 vs. 0.62) and the lower MAE (1.63 vs. 1.55). It follows that knn has a higher RSQ and a smaller MAE which means it is clearly the better model than lm. But be careful: RMSE and RSQ in the KNN model depends directly on k. We must optimize k first!

4.1.0.3 How would you position the KNN and the linear regression model along the spectrum of the bias-variance trade-off?

The lm-model tends to be underfitted and thus has a high bias. The trade-off shifts to the bias.

We are not sure weather we understand the question right. For the knn-model with k = 8, RMSE is lower as for the lm model. But we know that optimal k is 15 which means our model is underfitted. The trade-off shifts to the bias. But the shift is not so big as for the lm model which means knn with k = 8 has the better trad-off than lm but it is not better than knn with k = 15. Knn-models with k = 50 and k= 100 are overfitted. The trade-off shifts to the variance.

4.1.0.4 Visualize temporal variations of observed and modeled GPP for both models, covering all available dates.

lm-model

First, there is a annual cycle. We also see that the prediction varies in quality. There are years with a lower residue than other years. The model underestimate the target for a certain season and overestimate for the opposite season. We want to know more. We plot the residue as a function of the day of the year. Now we see, that our lm-model underestimate the target in winter and overestimate it in summer. Because the target is GPP (Gross Primary Productivity) it makes totally sense.

knn-model

We make the same thing for the knn-model. The results is very similar to the lm-model. For knn, the bias is lower. But the model underestimate also the winter target and overestimate the values during summer.

# Load the function into the file
source("../../R/re_ml_02/function.vis.residue.R")

# Residue for Davos: train: pooled, test: Lägern (year)
time.variation.year(mod.lm, daily_fluxes_test,
                    c("multiple linear regression (lm)"), 
                    c("re_ml_1 (Chapter 9)"))

time.variation.year(mod.lm, daily_fluxes_test,   
                    c("multiple linear regression (lm)"), 
                    c("re_ml_1 (Chapter 9)"), annual = FALSE)

# Residue for Davos: train: pooled, test: Lägern (year)
time.variation.year(mod.knn, daily_fluxes_test,
                    c("KNN with: Train = Davos, Test = Davos, k = 15"), 
                    c("re_ml_1 (Chapter 9)"))

# We want a better resolution of the residue of Davos (daily). We use plot = false to return a tibble. After that, we can pipe it
time.variation.year(mod.knn, daily_fluxes_test,   
                    c("KNN with: Train = Davos, Test = Davos, k = 15"), 
                    c("re_ml_1 (Chapter 9)"), annual = FALSE)

4.2 Discus the role of k and generalisability of a model

The k in KNN is a hyper-parameter that refers to the number of nearest neighbors. To quantify the “errors”, we use MAE instead of RMSE. Although the values are different, their flows are synchronous. That means that both MAE and RMSE are hyperbolas and have their global minima at the same k.

Hypothesis:

“The lower we chose k, the higher the RSQ in the training data, but the higher the variance in the test data.”

Explanation: Let k = 1. Then the training data would be perfectly predicted because only one
(the nearest) neighbor would be taken into account. We would hit every target value. The bias would be zero, and therefore RSQ = 1, and the MAE is 0 as well. But the model would be totally overfitted. That means there is a bias-variance trade-off. If the bias in the training data is low, then the variance in the test data is, in general, high.

Let k = N. Then it would follow: \(RSQ\in(0,1]\) and \(MAE\in[0, \infty)\). But we do not know where it is. However, the model tends to be underfitted because it takes all neighbors into account. Therefore, the prediction would be the mean value of the observations.

Conclusion: We want a model with zero bias and zero variance. But this seems impossible, so we have to find the model with the best bias-variance-trade-off.

4.2.0.1 Showing model generalisability as a function of model complexity and optimal k

The hypothesis is not fully true. RSQ is a quantity to describe the variance. Or better, it describes how much of the variance explains the model. For the best bias-variance trade-off, we want k where the bias in the test data is lowest (low MAE).

The model for k = 1 is totally overfitted. In the training data, RSQ is 1 and it follows that the bias must be 0. But in the test data, MAE is at a global maximum. The model is useless for any predictions, and the trade-off shifts to too much variance.

If we slowly increase k, then increase MAE and decrease RSQ on the train set. In the test data RSQ increase and MAE decrease. If we exceed optimal k, then RSQ decrease and MAE increase in the test set. We see, there mus be a optimal k.

For k = 100 the RSQ is lowest in the training data. The MAE increased for both data sets but is not at a maximum. The model is not suitable for predictions because the trade-off is on the side of too much bias.

Long story short: If k is to the left of optimal k, then the model is overfitted. If k is to the right of optimal k, then it is underfitted.

The optimal k is there, where the trade-off is best. Generalisability means that we want the model with the lowest MAE in the test data (and so the lowest bias) and the highest RSQ (low variance). For that, we mark the minimum of the RMSE test data hyperbola (green point = 15). Unfortunately, k = 15 is not necessarily the best model. With our function we have only seen that it is between 10 and 20. If we make another function call and use all k between 10 and 20, then we would see that optimal k is 15. We choose this hyper-parameter to calculate our model.

Important: Here we set a seed and split our data only pseudo-randomly. If we do not use set.seed(123), then it would be another optimal k.

# Load the function into the file
source("../../R/re_ml_01/function.parameter.extracter.R")

# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))

# Visualize the MAE and RSQ --> optimal k is 15
parameter.extracter(my.sequence, daily_fluxes_train, daily_fluxes_test, 
                    c("re_ml_01 (Chapter 9"))